Red Wine Quality Exploration by Fen Li

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

We can see from the summary table, there are some variables that may have outliers, like fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates. Especially for residual.sugar, total.sulfur.dioxide and chlorides, the maximum values are very far away from the 3rd quantile.

Univariate Plots Section

We can see that most of red winds in our dataset get rated in 5 and 6.

The fixed.acidity and volatile.acidity variables seem like normally distributed, however citric.acid is pretty right skewed and there is no much change after applying the log-transform and sqrt-transform.

The residual.sugar and chlorides variables are normally distributed except that there are some outliers for both of them.

Both variables of free.sulfur.dioxide and total.sulfur.dioxide are skewed to the right and have some outliers. After applied the log-transform, they seemed normally distributed.

The density, pH and sulphates variables look normally distributed. And we can see that the variance of density is very small, most of values are in the range between 0.993 and 1.

The alcohol variable is right skewed and there is no big change after applying sqrt and log tranform.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations with 12 features. The variable quality is discrete and other variables are continuous.

What is/are the main feature(s) of interest in your dataset?

The main feature in the dataset is quality. And I’d like to find which features have impact in determing the quality of red wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The features like volatile acidity, citric acid, free sulfur dioxide,total sulfur dioxide and sulphates may have correlation with quality based on the information I get in the doc file provided by the author of dataset.

Did you create any new variables from existing variables in the dataset?

No new variable has been created right now.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I applied the log-transform to the right skewed variables including citric.acid, free.sulfur.dioxide, total.sulfur.dioxide and alcohol to get better insights about their distributions.

Bivariate Plots Section

Since I want to check the attributes’s correlation and it’s not that clear for me to do the analysis based on matrix plot, I’ll add a correlation matrix plot.

According to the matrix plot and Spearman correlation coefficient matrix, we can see that: + the coefficients of correlation between quality with variables like alcohol, volatile acidity, citric acid and sulphates are 0.476, -0.391,0.226 and 0.251 correspondingly, which means these variables have relatively higher correaltions with quality compared to other vatiables. + Besides the four variables mentioned above, there are some variable including density, total sulfur dioxide, chlorides and fixed acidity which has lower coefficient of correlation (smaller than 0.2) but may also be related with quality. + There are also some moderate correlations between variables not including quality. For example, the relationships between citric acid with fixed acidity,volatile acidity and pH, density with fixed acidity and alcohol and pH with fixed acidity.

So let’s dig deep into this.

Before I start the following analysis, I’ll change quality to factor type since it includes only 6 discrete values.

Coorelation with Quality

Alcohol and Quality

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## [1] "Summary of alcohol:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

From the boxplots of quality with alcohol, it seems like that the red wines with higher quality scores have a larger median amount of alcohol if we only consider about the wines with quality score above 5. And we can also see that there are a lot of outliers for wines with quality of 5. So it’s difficult to discribe the relationship between alcohol and quality according to the boxplot. But with the combination of scatter plot, we can clearly see that there is a positive correlation between the two variables. Although the correlation is only moderate (r = 0.476, p-value < 0.001), but the pretty low p-value is a strong evidence that the correlation is reliable.

Volatile acidity and Quality

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## [1] "Summary of volatile.acidity:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

  • Based on the plots above, we can see that volatile acidity and quality have a negative correlation, the higher the volatile acidity, the lower the quality, and wines with quality of 7 and 8 have very close median and 3rd quantile volatile acidity.
  • The very low p-value is also a strong evidence for the negative correlation (r = -0.391, p-value < 0.001).
  • Our observation is consistent with the discription provided by the author, that is, volatile acidity at too high of levels can lead to an unpleasant, vinegar taste.

Citric acid and Quality

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## [1] "Summary of citric.acid:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

  • Based on the plots above, we can see that quality is positively related with citric acid. But there are two observation worth to be mentioned:
  • Wines with quality of 7 versus 8, 5 versus 6 have very close median and 1rd quantile value of citric acid.
  • And the distribution of citric.acid is very dispersed especially for quality under 6
  • From the scatter plot, we can see a slightly increasing trend by adding the smooth line, which is consistent with the Pearson’s correlation(r = 0.226, p-value < 0.001),the r value indicates a relatively weak strength of correlation but the low p-value indicates the correlation is significant.
  • And the observation seems reasonable because the author of dataset mentioned that citric acid is found in small quantities and can add ‘freshness’ and flavor to wines

Sulphates and Quality

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## [1] "Summary of sulphates:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

It seems like the correlation between quality and sulphates is slightly positive and there are many outliers. So I add xlim in scatter plot in order to find a better insight about the relationship. + In the scatter plot, we can find that there is an increasing trend when sulphate in wine is under 0.9. And it seems like quality is slightly negative related with sulphate over 1.0. + Then I go and check the description of attributes provided by author, it says that sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine, which makes our observation reasonable. + And same as the previous analysis, the low p-value(r = 0.251, p- value < 0.001) also makes the positive correlation reliable.

Other Attributes with quality

For total sulfur dioxide, chlorides and fixed acidity, the relationships with quality are not that strong so we cannot decribe their effect on quality clearly. Only for density, we can see a slightly negative correlation with quality. However, the variance of density is so small that I think it is not even possible to be detected by our sense of taste. I guess the relationship is observed because density of water depends on the percent alcohol and sugar content which is mentioned in the doc file provided by the author of dataset.

Correlation between Attributes

We can see that the pH is negatively related with citric.acid and fixed.acidity, and it makes sense because pH less than 7 is said to be acidic. And citric.acid is positively correlated with fixed.acidity, and negatively correlated with voltatile.acidity.

The plots above indicate that density is positively related with fixed.acidity and negatively related with alcohol, which makes sense since the density of tartaric acid (fixed.acidity), water and alcohol is 1790 kg/m^3, 1000 kg/m^3 and 806 kg/m^3 correspondingly.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The relationships I observed include the positive correlation between quality with alcohol, citric.acid and sulphates, and negative correlation between quality with volatile.acidity and density. And I add a smoothing line in the scatter plot to help us identify the relationships between quality and other attributes of red wines. It seems like most of the relationships are not exactly linear.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The citric.acid is positively correlated with fixed.acidity, and negatively correlated with voltatile.acidity. And I find that pH is negatively related with citric.acid and fixed.acidity, density is positively related with fixed.acidity and negatively with alcohol, which are obviously reasonable.

What was the strongest relationship you found?

The strongest relationship I found is that between quality and alcohol, which has the highest r value (0.476) compared to other correlations of attributes with quality. And the second one is between quality and volatile.acidity (r = -0.391). The last two are correlations of quality with sulphates (r = 0.251) and citric.acid (r = 0.226).

Multivariate Plots Section

Based on the plot above, we can observe that wines with quality of 7 and 8 are mostly located in the right-bottom part when compare to points with quality of 3 and 4. That means wines with high quality have relatively higher citric acid and lower volatile acidity.

According to the quality scatter plot by alcohol and volatile.acidity, we can see that points with same quality are less dispersed in the horizontal dimension compared to the first multivariate plot in this section, which is unsurprisingly since alcohol has stronger correlation with quality than that citric acid has.

At this point, we get a relatively clearer scatter plot in this section. we can obviously see that the points of quality 7 and 8 are mostly located in the right-upper of plot, and most of points with quality 3,4,5 are in the left part. And points with same quality are less dispersed in horizontal level versus that in vertical level. So once again, it proves that alcohol has the strongest correlation with quality.

I combine the top 3 correlated attributes with quality in this plot and then remove the points with a moderate quality of 5 and 6 to get a clearer vision about the effect of each attribute on wines quality. The plots proves the analyses we’ve done in the previous part. That is, wines with high quality scores have lower volatile acidity and higher alcohol volume and citric acid content.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The top three correlated features to quality, which are alcohol, citric acid and volatile acidity, strengthen each other in our mulivariate scatter plots. In a word, wines with high quality have relatively higher alcohol volume and citric acid and relatively lower volatile acidity.

Were there any interesting or surprising interactions between features?

According to the scatter plots in this section, I found that point are more dispersed in citric acid dimension compared to the other two (alcohol and volatile acidity), which seems reasonable since citric acid is found in small quantities and can add ‘freshness’ and flavor to wines according to what is said by author.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Since there are many outlies in the dataset and the strongest correlation(alcohol with quality) only get a r value under 0.5, I cannot find a very precise model to predict the quality of wines.


Final Plots and Summary

Plot One

Description One

The quality of wine is rated by at least 3 wine experts between 0 (very bad) and 10 (very excellent), and there are only 6 discrete values (3,4,5,6,7,8) in our dataset. Most of the scores are 5 and 6, they account for 42.6% and 39.9% of the whole dataset. Which means most of the wines are moderate, bad wines (quality 3) and excellent wines (quality 8) in our dataste only account for 0.63% and 1.1%.

Plot Two

Description Two

The second plot includes three boxplots of attributes (alcohol, volatile acidity and citric acid) by quality, these three attributes have the top 3 strongest correlation with quality. And it’s not hard for us to find that quality is negatively related with volatile acidity and positively related with alcohol and citric acid. If we focus on the median values of the box, we can find that the relationships of these three attributes with quality are not exactly linear.

Plot Three

Description Three

I put all of the three attributes which has the strongest correlation with quality in the third plot, which can help us get clearer insights about the relationships between these attributes. And in the left scatter plot there are too many points with quality of 5 and 6 that may affect our determination of the relationship between attributes, so I add a plot which only keep points with relatively extreme quality scores (3,4,7,8) to get a better vision. With these two plots, it’s not hard for us to find that most points with quality of 7 and 8 are in the right-upper area of plots with a smaller point size, and the points of 3 and 4 are in a obviously opposite way. That is, red wines of high quality have a relatively higher alcohol volume and citric acid content and a lower volatile acidity, and bad wines are in the opposite way.


Reflection

The red wine dataset includes 1599 observations with 11 attributes on the chemical propertied of the wine and quality of the wine which is rated by at least three wine experts between 0 (very bad) and 10 (very excellent). So the quality is an subject variable and the rest 11 ones are objective. What I’m most interested about this dataset is to find which attributes of the red wines have an effect on the red wines quality. Then I follow the guidance provided in the template to start my exploration.

I have to say that it’s difficult for me to start the EDA process for the red wines dataset. Since it’s not like the diamonds dataset in the learning lessons, which we are very familiar with and has a relatively obvious relationship between variables and even before we actually start the analysis, we can find some potienally related variables by our intuition and experiences. However, the red wines contains many chemical variables I’m not familiar with. So what I did is to take a deep look in the doc file provided by the author of dataset befor I go deep into the analysis process. This helps a lot and give me some tips on attributes which I maybe need to put more attention to.

And another problem I face with during the Bivariate and Multivariate Analysis is the correlation between attributes like alcohol, volatile acidity and citric acid with quality is moderately to weakly. And these variables also have correlations with each other, so it’s hard to identify the fundamental factors that actully affect the quality of wines based on the given dataset.

Last, I did not create any model to predict the quality of red wines in my EDA process. One reason is what I’ve mentioned above, the correlations between variables are not very strong. The second one is there are no records of wines with quality under 3 or above 8. So if we can get a more complete dataset of red wines in the future, it would be easier for us to create a nice model to predict the wines quality.